Script Overview

This analysis derives a transcriptomic signature that predicts survival for renal carcinoma patients. Using transcriptomic data from NIH Genomic Data Commons (GDC), abnormally expressed genes are derived via differential expression analysis (DEA) of tumor vs. normal associated tissue (NAT). The resulting deferentially expressed genes (DEG) where assessed using a Kaplan-Meier survival analysis & Cox Proportional Hazards (COXPH) model to determine which genes best predict patient outcome. The resulting prognostic signature shows robust predictive value of appx. AUC 0.8 & includes a list of genes with well-known associations to tumor progression, including NF𝛋B, DNA repair, & HIF-1⍺ signaling pathways.



Acronyms

DEA - Differential expression analysis

DEG - Differentially expressed genes

GDC - Genomic Data Commons (Nation Cancer Institute)

NAT - Normal associated tissue (GDC calls this “Solid tissue normal”)

KIRC - Kidney renal clear cell carcinoma

PCA - Principal components analysis

FDR - False discovery rate (adjusted p value)

KM - Kaplan-Meier Analysis

COXPH - Cox Proportional-Hazards model

AUC - Area under the curve; a metric of sensitivity & specificity of a predictive model.



Instructions

  1. OPTIONAL: Make sure GDC client is installed on your machine. Default setting uses API. https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

  2. OPTIONAL: Input a file path for your downloaded transcriptomic data in the first code chunk below. Total downloaded data are 2.5GB as of 5/16/22.

  3. Run this script (click “Knit” above) (takes ~15min to run on my laptop). If you encounter problems with GDC download, update the TCGAbiolinks package and other BioC packages, as described in “Preliminaries” Section below.



Analysis Outline



Stage 1 - Assess data structure & design Limma model

1.1) GDC is queried for human Kidney Renal Clear Cell Carcinoma (KIRC) bulk RNA sequencing of tumor and normal associated tissue (NAT) samples.

Source: https://portal.gdc.cancer.gov/projects/TCGA-KIRC

1.2) Clinical metadata are profiled to look for confounders that need to be corrected for in the limma model design. For example, tumor and NAT samples showed differences in their tumor stage profile, meaning tumor stage needed to be corrected for in differential expression analysis comparing tumor & NAT.

1.3) Library sizes were plotted to determine whether VOOM is needed. VOOM is typically used for DEA when library size variance is high. Library size variance is calculated (e.g. determine whether to use VOOM). This is done by first setting up a preliminary design matrix, normalizing the data, and plotting the library distributions & normalization correction factors.

Stage 2 - Determine which genes show abnormal expression in cancer. (DEA; tumor vs NAT)

2.1) Re-run GDC query with a filtered samples list (some samples with small lib size were removed)

2.2) Limma/VOOM were used to obtain DEG (NAT vs. tumor)

2.3) Determine whether tumor stratification is needed. Asking which genes significantly predict patient outcomes should take into consideration whether there are multiple distinct tumor phenotypes. E.g. if there are two+ tumor sub-types, it may be appropriate to create two+ survival model for each subtype. Correlation heat mapping and PCA were used to visualize the number of major sample populations & and assess whether tumor sample stratification might be appropriate (no clear basis for stratification was observed).

Stage 3 - Derivation of a prognostic signature using predictive survival modeling (Kaplan Meier & Cox Prop. Hazards)

3.1) Create a survival model using the top abnormally expressed genes (DEG from NAT vs. tumor DEA), were used to determine which genes would predict survival, yielding a prognostic gene signature.

3.2) Create a Cox Proportional Hazards Model using the prognostic genes identified in the Kaplan-Meier analysis. The primary COXPH output is an AUC value over time as an assessment of the overall predictive value of the prognostic signature genes.



Preliminaries - Set target directory

Get Started:

Input the file path to which you want to download the GDC data. GDC KIRC transcriptomic counts will be directed to this folder. Once the directory has been set, run all code: (click “Run” or “Knit” above).

library(rstudioapi)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # setwd to source file location
dir.create("GDCdata")
## Warning in dir.create("GDCdata"): 'GDCdata' already exists
target_directory <- file.path(getwd(), "GDCdata")   # Optional: change target path for GDC data (~2.5GB)

library(knitr)
knitr::opts_chunk$set(error = FALSE, cache = TRUE, message = FALSE, warning = FALSE, echo = FALSE, include = TRUE, dpi = 300)

Even if you have the following BioC packages installed, consider updating them, as GDC updates their packages frequently.



Stage 1 - Assess data structure and plan Limma design matrix



1.1 GDC query

GDC KIRC RNAseq library counts prior to filtering
sample_type n
Primary Tumor 540
Solid Tissue Normal 72



1.2 Plan DEA based on clinical data

Info on GDC metadata: https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=demographic&anchor=age_at_index

In preparation for the limma differential expression analysis (DEA) comparing tumor and normal associated tissue (NAT), clinical metadata is examined here to ask whether several clinical factors may confound the DEA. For example, is the tumor stage profile or the proportion of males different between tumor & NAT? Such clinical variables need to be corrected for statistically in the limma model.

– Plotting continuous variables –

Assessment: age_at_index, age_at_diagnosis, and tumor size (longest_dimension) are not significantly different between primary solid tumor and solid tissue normal (NAT), & are therefore not incorporated into the limma design matrix.



Examine metadata for sample variables that would confound a comparison of tumor & NAT.

– Plotting discrete variables –



Assessment: ajcc_pathologic_stage is signficantly different between tumor & NAT, and should be corrected for in the final limma model. Difference in sex-ratio between tumor vs. NAT will also be corrected for (even though chi-sq p-val showed differences are not statistically significant), due to male-female genetic distance.



Create preliminary design matrix comparing sample groups “Solid Tissue Normal” (NAT) to “Primary solid Tumor”

Create DGEList object (S4) that contains counts, norm counts & library sizes

Remove low expressed genes

## Formal class 'DGEList' [package "edgeR"] with 1 slot
##   ..@ .Data:List of 3
##   .. ..$ : int [1:28142, 1:612] 3195 38 1122 490 201 1287 1706 2621 1785 1251 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. ..$ :'data.frame':  612 obs. of  72 variables:
##   .. ..$ :'data.frame':  28142 obs. of  10 variables:
##   ..$ names: chr [1:3] "counts" "samples" "genes"



1.3 Library QC & filtering

Here, sample library sizes, normalization factors, and expression distributions are visualized to look for any data abnormalities (i.e. samples w/ poor sequencing or biological coverage). The red dashed line shows the library filtering threshold. The removal of small libraries eliminates ffpe samples, and also eliminates samples with less data.

Visualizing these data also informs the ‘Analysis Stage 2’ decision to use VOOM for the differential expression analysis, which is useful were library size variance is high.

Scale normalization of the RNAseq read counts using TMM (trimmed mean of M component) normalization method.

Plot library normalization factors

Plot library expression distributions






Stage 2 - Differential expression analysis (tumor vs NAT)

Re-do GDC query, preparation, & normalization using a new filtered sample list (samples with larger library sizes). Then use a Kaplan-Meier analysis and COXPH model to find genes the predict survival.

So far…

A GDC query and download was done to visualize the metadata, identifing potential confounders that should be corrected for in the final voom/limma design matrix below. The library sizes and normalization factors were plotted to find outliers, and a sample list excluding very small libraries was created.

Next…

The GDC query will be set up again using the filtered sample list. Design matrix and normalization factors are computed again to ensure compatibility with the new query.



2.1 GDC query of QC’ed library list



2.2 Limma/VOOM DEA

Voom does a transformation & differential expression calculation. VOOM was chosen over limma-trend because the library sizes are variable between samples (>3-fold differences). The VOOM transformation is applied to the normalized & filtered DGEList object ‘dge’.

In the voom plot below, each point is one gene, with x axis showing the avg expression in log2 cpm reads and y axis is the square-root of the residual stdev of expression in log2 counts per million reads.

Voom transformation uses the experiment design matrix, and produces an EList (S4) object

After VOOM, the usual LIMMA pipelines can be run, starting w fitting a model

Derive & plot top significantly (BH FDR10) differentially expressed genes (DEG) from the limma model.

## .
## non-signif.     signif. 
##       13735       14060
## .
## non-signif.     signif. 
##       24165        3630





Tabulate n values for Limma differential expression analysis above.

No. RNAseq samples after filtering.
sample_type n
Primary Tumor 511
Solid Tissue Normal 72



Plot top 20 DEG (derived comparing tumor vs. NAT):



2.3 Sample stratification needed?

A useful prognostic signature will ideally mean a single signature that predicts survival for all or most cases of a defined indication such as KIRC. For clinicians, it is generally impractical to determine which phenotypic subcluster a tumor transcriptome belongs to and then choose a prognostic signature on the basis of such a classification.

Nonetheless, a single “one-size-fits-all” prognostic signature may not be possible if there are multiple, distinct tumor sub-populations within the KIRC classification. To explore this possibility, tumor sample relationships are visualized here to determine whether multiple prognostic signatures may be required.

Assessment of PCA plots below: Although there are clearly two populations–NAT and tumor–the tumor samples do not clearly require stratification. There is one, albeit phenotypically diverse tumor population.

The PCA plots above & scree plot below show that PC1 represents the major component of difference between healthy and tumor tissues. PC’s 1 thru 4 do not show distinct sub-populations within the KIRC tumor samples, although there are a few phenotypic outliers. Based on this observation, a single prognostic signature for all tumor samples will be derived and tested for predictive power below.

Use a Spearman correlation heat map of the VOOM-normalized series matrix to look at the prevalence of outliers and the number of tumor sample populations. If multiple, clearly-distinct tumor populations are present, sample stratification may be needed (i.e. comparing NAT to each stratified tumor population).

A survival model will be constructed below to see whether a reliable prognostic signature can be derived from an un-stratified set of tumor samples.






Stage 3 - Predictive modeling

3.1 Kaplan-Meier analysis

Abnormally expressed DEG (found by comparing tumor vs. NAT), were used here to assess their predict value toward overall survival. The response variable combines “days_to_last_follow_up” & “days_to_death”, using “vital_status” (death) as a censored variable. The output is a list of genes with prognostic value, and these are used in the subsequent coxph model.

KM Analysis input are tumor samples only, examining genes that are a) in the top 10% of genes terms of high expression variance b) are significantly differentially expressed between tumor and NAT (Benjamini-Hochberg FDR10).

Data filtered prior to KM Analysis:

  • DEG < FDR10

  • Top 10% of variance

  • Tumor samples only

The KM analysis above generates a list of genes was found that significantly predicted survival (BH FDR10) . However, p-values have limited utility as sample numbers increase, so this list is refined below include samples with high fold changes (cutoff was 2^0.3219281 = 25% change).

Kaplan-Meier Analysis Genes Predicting survival
ENSG pval FDR FC_log2
ENSG00000073150.14 0.0000 0.0000000 0.3261780
ENSG00000104313.20 0.0000 0.0000000 0.3308153
ENSG00000119547.6 0.0000 0.0000000 0.3637163
ENSG00000123838.11 0.0000 0.0000000 0.3906049
ENSG00000155269.12 0.0000 0.0000000 0.3768961
ENSG00000179698.14 0.0000 0.0000000 0.4308977
ENSG00000181449.4 0.0000 0.0000000 0.3269393
ENSG00000184363.10 0.0000 0.0000000 0.3531697
ENSG00000198768.11 0.0000 0.0000000 0.3315854
ENSG00000225783.8 0.0000 0.0000000 0.3728919
ENSG00000231748.1 0.0000 0.0000000 0.3495725
ENSG00000236453.5 0.0000 0.0000000 0.3565338
ENSG00000236499.2 0.0000 0.0000000 0.3235660
ENSG00000236591.1 0.0000 0.0000000 0.3382766
ENSG00000267121.6 0.0000 0.0000000 0.4104828
ENSG00000268649.5 0.0000 0.0000000 0.3858131
ENSG00000154277.13 0.0001 0.0008323 0.3575658
ENSG00000175265.17 0.0001 0.0008323 0.3905267
ENSG00000197599.12 0.0001 0.0008323 0.3954686
ENSG00000106327.13 0.0002 0.0014442 0.3486972
ENSG00000169885.10 0.0002 0.0014442 0.4053682
ENSG00000225217.1 0.0002 0.0014442 0.3763401
ENSG00000228727.9 0.0002 0.0014442 0.3584158
ENSG00000229759.1 0.0003 0.0020096 0.3664934
ENSG00000263345.1 0.0003 0.0020096 0.3411169
ENSG00000213658.12 0.0004 0.0024547 0.3672919
ENSG00000236017.8 0.0004 0.0024547 0.3358611
ENSG00000100288.20 0.0005 0.0028958 0.3754193
ENSG00000250067.12 0.0007 0.0037640 0.3474407
ENSG00000253317.1 0.0009 0.0046162 0.3347877
ENSG00000271959.1 0.0009 0.0046162 0.3274670
ENSG00000234537.1 0.0013 0.0061778 0.3463489
ENSG00000205702.11 0.0020 0.0086068 0.3995944
ENSG00000236871.7 0.0034 0.0130733 0.3338875
ENSG00000287089.1 0.0039 0.0145141 0.3581816
ENSG00000271821.1 0.0048 0.0172850 0.3544788
ENSG00000277883.1 0.0137 0.0393044 0.3507425
ENSG00000241769.7 0.0188 0.0505455 0.3679429
ENSG00000284874.1 0.0239 0.0607887 0.3826835



3.2 Cox Proportional Hazards Model & AUC assessment

In order to further refine the prognostic gene list, and to generate an AUC for the genes that predict survival according to the Kaplan-Meier Analysis, a COXPH model us used here. The genes with the most significant hazard ratios are used to calculate an AUC over time.

Input the Kaplan-Meier genes into a Cox Proportional Hazards (COXPH) model, in order to visualize their hazard ratios and p values in a volcano plot for the entire data set.

Determine AUC using genes with absolute hazard ratio >=10%

https://datascienceplus.com/time-dependent-roc-for-survival-prediction-models-in-r/

Derive an AUC from only significant genes (from COXPH model for all data). One of the 7 genes, IP6K3, is removed here as it is does not have significant predictive power in the more targeted model (6 genes) below. Above, the top genes are filtered above are further refined to a smaller panel that would be feasible in a clinical setting (i.e. 5-10 gene transcripts).

Testing and plotting a AUC over time for the top 7 prognostic genes.

Tabulate coefficients for the 7-gene COXPH model

Kaplan-Meier univariate survival curves showing the top 7 predictive genes from the COXPH model above.

3.3 Incorporate new variables into the COXPH model: age_at_index, gender, & ajcc_pathologic stage into the model

Input here into the new cox model are the top genes from the KM analysis above, but this time, incorporating the three clinical vars to see if they enhance the model.

Tabulate coefficients for the 7-gene COXPH model along with age, gender, & ajcc pathologic stage.

Discussion

A prognostic signature of six transcripts was derived here. This signature predicts survival with the Chambless and Diao’s estimator of cumulative/dynamic AUC = 0.8 for the COXPH model. This predictive capacity that remains durable over 12 years after index (sample collection time). The addition of clinical variables to the model–including tumor stage & patient age at index–does add moderate predictive capacity to the model, but the advantage is only slight. These prognostic transcripts include transcripts known to contribute to tumor progression, including NF𝛋B signaling (LAT) , HIF-1⍺ signaling (ONECUT2), as well as inositol hexakisphosphate kinase 3 (IP6K3) & glucose-regulated protein 78 (GPR78), both of which have a previously defined negative prognostic value in renal cancer. Prognostic genes also included cytochrome P450, CYP2D7 consistent therapeutic resistance predicting survival. Expression of these genes provide valuable prognostic information for KIRC patients.




References:

ONECUT2 - Accelerates tumor progression in gastric, CRC, prostate, hepatocellular cancers. Androgen resistance and HIF1alpha signaling pathways.

https://www.nature.com/articles/s41467-018-08133-6

IP6K3 - prognostic in renal cancer

https://www.proteinatlas.org/ENSG00000161896-IP6K3

LAT - NFkB signalling; immune signalling; MHC and TCR signaling

https://www.proteinatlas.org/ENSG00000213658-LAT/pathology/renal+cancer

GPR78 - prognostic in renal cancer

see kaplan-meier plot https://www.proteinatlas.org/ENSG00000155269-GPR78/pathology/renal+cancer

LINC00896

https://pubmed.ncbi.nlm.nih.gov/34790382/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8576217/

SAPCD1 - DNA mismatch repair; prognostic in cancer

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7391198/

CYP2D6 - drug resistance / detoxification https://www.frontiersin.org/articles/10.3389/fphar.2010.00121/full

Long-non-coding RNA’s show prognostic value in KIRC

https://www.frontiersin.org/article/10.3389/fonc.2020.01430